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1 Exploiting task-level concurrency in a programmable network interface 
Hyong-youb Kim, Vijay S. Pai, Scott Rixner 

June 2003 ACM SIGPLAN Notices , Proceedings of the ninth ACM SIGPLAN symposium 
on Principles and practice of parallel programming, volume 38 issue 10 

Additional Information: full cjtation, abstract, references, citings, index 



Full text available: 1 



dff191.35 KB; 



terms 



Programmable network interfaces provide the potential to extend the functionality of 
network services but lead to instruction processing overheads when compared to 
application-specific network interfaces. This paper aims to offset those performance 
disadvantages by exploiting task-level concurrency in the workload to parallelize the 
network interface firmware for a programmable controller with two processors. By carefully 
partitioning the handler procedures that process various events related to ... 



Keywords: ethernet, firmware, parallel programming, programmable network interface 



2 Maps: a compiler-managed memory system for raw machines 
Rajeev Barua, Walter Lee, Saman Amarasinghe, Anant Agarwal 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 

Full text available: || pdf{2ai.60 KB) Additional Information: full citation , abstract, references ; citings . Index 
^.PybJisherSlte 

This paper describes Maps, a compiler managed memory system for Raw architectures. 
Traditional processors for sequential programs maintain the abstraction of a unified memory 
by using a single centralized memory system. This implementation leads to the infamous 
"Von Neumann bottleneck," with machine performance limited by the large memory latency 
and limited memory bandwidth. A Raw architecture addresses this problem by taking 
advantage of the rapidly increasing transistor budget to move much of ... 
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Yuetsu Kodama, Hirohumi Sakane, Mitsuhisa Sato, Hayato Yamana, Shuichi Sakai, Yoshinori 
Yamaguchi 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture, Volume 23 issue 2 

Full text available- -g$ odffl 04 MB: Additjonal ,nformation: fu " citation - ^™^> references , cite, index 
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Latency tolerance is essential in achieving high performance on parallel computers for 
remote function calls and fine-grained remote memory accesses. EM-X supports 
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interprocessor communication on an execution pipeline with small and simple packets. It can 
create a packet in one cycle, and receive a packet from the network in the on-chip buffer 
without interruption. EM-X invokes threads on packet arrival, minimizing the overhead of 
thread switching. It can tolerate communication latency by using ... 



4 A survey of commercial parallel processors 

Edward Gehringer, Janne Abullarade, Michael H. Gulyn 

September 1988 ACM SIGARCH Computer Architecture News, volume 16 issue 4 
Full text available: ^pdf(296.fv1B]t Additional Information: fy.H .citation, abstract, citings, index .terms 

This paper compares eight commercial parallel processors along several dimensions. The 
processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent 
Balance system, the Alliant FX series, and the ELXSI System 6400) and four network 
multiprocessors (the BBN Butterfly, the NCUBE, the Intel iPSC/2, and the FPS T Series). The 
paper contrasts the computers from the standpoint of interconnection structures, memory 
configurations, and interprocessor communication. Also, the share ... 




5 The„NYU.U^ 
(Extended Abstract) 

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, Marc Snir 
April 1982 Proceedings of the 9th annual symposium on Computer Architecture 

Full text available: ftsdlY 1.36 MEp Additional Information: fgjl citatjon, abstract, references, chinas, index. 
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We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine 
composed of thousands of autonomous processing elements. This machine uses an 
enhanced message switching network with the geometry of an Omega-network to 
approximate the ideal behavior of Schwartz's paracomputer model of computation and to 
implement efficiently the important fetch-and-add synchronization primitive. We outline the 
hardware that would be required to build a 4096 processor system using 1990' ... 

6 Ihe.NYU.uitracompMt 

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, Marc Snir 
August 1998 25 years of the international symposia on Computer architecture 
(selected papers) 

Full text available: * W{ odfM.74 MB) Additional Information: full citation , references , index terms 



Embedded applications: AES and the cry ptonite crypto processor 
Dino Oliva, Rainer Buchty, Nevin Heintze 

October 2003 Proceedings of the 2003 international conference on Compilers, 
architectures and synthesis for embedded systems 

Full text a va liable: ^_pdf(346 .09 .KB) Add ition a I I nfo rmati on : f yll. citat io n , a bstra ct , refe rences, in d ex terms 

CRYPTONITE is a programmable processor tailored to the needs of crypto algorithms. The 
design of CRYPTONITE was based on an in-depth application analysis in which standard 
crypto algorithms (AES, DES, MD5, SHA-1, etc) were distilled down to their core 
functionality. We describe this methodology and use AES as a central example. Starting with 
a functional description of AES, we give a high level account of how to implement AES 
efficiently in hardware, and present several novel optimizations (whic ... 

Keywords: AES, architecture, cryptography, high-bandwidth, high-speed, processor, round 
key generation, software implementation 
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June 1988 Proceedings of the 2nd international conference on Supercomputing 
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' Full text available: ^ jpdf(1.38 MB) Additional Information: full citation , abstract , references , citings, index 

terms 

Performance of high-speed multiprocessor systems is limited by the available bandwidth to 
memory and the need to synchronize write sharable data. This paper presents a new 
memory system that separates synchronization related data from others. The memory 
system has two tiers: synchronization memory and high bandwidth (HB) memory. The 
synchronization memory consists of snooping caches connected to a bus and is used to store 
synchronization variables such as locks and semaphores. The H ... 

9 MULTILISP: a language for concurrent symbolic computation I 
Robert H. Halstead 

October 1985 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 7 Issue 4 

r-n* ^ , u si Additional Information: fall citation, abstract , references , citings , index 

Full text available: TO oatf j.30 klBj. : 

^ * terms, .review 

Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel 
execution. Like Scheme, Multilisp is oriented toward symbolic computation. Unlike some 
parallel programming languages, Multilisp incorporates constructs for causing side effects 
and for explicitly introducing parallelism. The potential complexity of dealing with side 
effects in a parallel context is mitigated by the nature of the parallelism constructs and by 
support for abstract data types: a recommende ... 
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Philip C. Treleaven, David R. Brownbridge, Richard P. Hopkins 
January 1982 ACM Computing Surveys (CSUR), volume 14 issue l 

Full text available: fl h)dff4.14 $A&) Additional Information: full citation , references , citings , index terras 



11 .Bajancingperfornia 

Ilija Hadzic, Jonathan M. Smith 

November 2003 ACM Transactions on Computer Systems (TOCS), volume 21 issue 4 
Full text available: ^ pdf(719.03 KB) Additional Information: full citation , abstract , references , index terms 

The goals of performance and flexibility are often at odds in the design of network systems. 
The tension is common enough to justify an architectural solution, rather than a set of 
context-specific solutions. The Programmable Protocol Processing Pipeline (P4) design uses 
programmable hardware to selectively accelerate protocol processing functions. A set of 
field-programmable gate arrays (FPGAs) and an associated library of network processing 
modules implemented in hardware are augmented with so ... 

Keywords: FPGA, P4, computer networking, flexibility, hardware, performance, 
programmable logic devices, programmable networks, protocol processing 



1 2 Resource 

Michael Upton, Thomas Huff, Trevor Mudge, Richard Brown 

November 1994 Proceedings of the sixth international conference on Architectural 

support for programming languages and operating systems, volume 29 , 

28 Issue 11 , 5 

Full text available: gpdM.IOMB) Additional lnformation: M ^ to SiDSi. l^ex 
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This paper discusses the design of a high clock rate (300MHz) processor. The architecture is 
described, and the goals for the design are explained. The performance of three processor 
models is evaluated using trace-driven simulation. A cost model is used to estimate the 
resources required to build processors with varying sizes of on-chip memories, in both single 
and dual issue models. Recommendations are then made to increase the effectiveness of 
each of the models. 
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Keywords: decoupled architecture, floating point latencies, nonblocking cache, pipelining, 
prefetching, resource allocation, superscalar 
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Jin Lin, Tong Chen, Wei-Chung Hsu, Pen-Chung Yew 

March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization 

Full text available: ^.pdg8M..35.KB). Additional Information: Motion, abstract, references, citings, index 
f f Publisher Site 

The pervasive use of pointers with complicated patterns in C programs often constrains 
compiler alias analysis to yield conservative register allocation and promotion. Speculative 
register promotion with hardware support has the potential to more aggressively promote 
memory references into registers in the presence of aliases. This paper studies the use of 
the Advanced Load Address Table (ALAT), a data speculation feature defined in the IA-64 
architecture, for speculative register promotion. An ... 

14 Fine-grained mobility in the Emerald system 
Eric Jul, Henry Levy, Norman Hutchinson, Andrew Black 

February 1988 ACM Transactions on Computer Systems (TOCS), volume 6 issue l 

— 1 1 , , .. . . _ , , D . Additional Information: full citation, abstract, references, citings, index 
Full text available: mpatt 2 0 i MB* ~ 

' term;; , review 

Emerald is an object-based language and system designed for the construction of distributed 
programs. An explicit goal of Emerald is support for object mobility; objects in Emerald can 
freely move within the system to take advantage of distribution and dynamically changing 
environments. We say that Emerald has fine-grained mobility because Emerald objects can 
be small data objects as well as process objects. Fine-grained mobility allows us to apply 
mobility in new ways but presents implemen ... 

1 5 Specialized . m erge processor. networks for 
Lee A. Hollaar 

September 1978 ACM Transactions on Database Systems (TODS), volume 3 issue 3 

r „, , , Ul .o, ltrrc ^ Additional Information: full citation , abstract , references , citings , index 

Full text available: pgft ^85 .06 KB- 
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In inverted file database systems, index lists consisting of pointers to items within the 
database are combined to form a list of items which potentially satisfy a user's query. This 
list merging is similar to the common data processing operation of combining two or more 
sorted input files to form a sorted output file, and generally represents a large percentage of 
the computer time used by the retrieval system. Unfortunately, a general purpose digital 
computer is better suited for complica ... 

Keywords: backend processors, binary tree networks, computer system architecture, full 
text retrieval systems, inverted file databases, nonnumeric processing, pipelined networks, 
sorted list merging 
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Tse-yun Feng, Chuan-lin Wu, Dharma P. Agrawal 

April 1979 Proceedings of the 6th annual symposium on Computer architecture 
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Full text available. T$lpafi839.o 1 KB: 
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This paper describes an asynchronous circuit switching network for multiple-processor 
systems. Several circuit switching networks for various applications have been proposed and 
constructed. However, there are problems associated with these networks. The 
asynchronous circuit switching network possesses several features that can solve these 
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problems. A three r stage fully connected topology is utilized to construct the network. Each 
switching element is functionally and physically identical an ... 

17 SgecuJMl^^ 
applications 

Jose F. Martinez, Josep Torrellas 

October 2002 Proceedings of the 10th international conference on Architectural 

support for programming languages and operating systems, volume 36 , 30 , 

37 Issue 5 , 5 , 10 

Full text available: gdfti.49 MBV Additional Information: full citation , abstract , references, citings 

Barriers, locks, and flags are synchronizing operations widely used programmers and 
parallelizing compilers to produce race-free parallel programs. Often times, these operations 
are placed suboptimally, either because of conservative assumptions about the program, or 
merely for code simplicity. We propose Speculative Synchronization, which applies the 
philosophy behind Thread-Level Speculation (TLS) to explicitly parallel applications. 
Speculative threads execute past active barriers, busy ... 

1 8 A„pre Hoi jna i^„arcMtectu re. for 
Jack B. Dennis, David P. Misunas 

December 1974 ACM SIGARCH Computer Architecture News , Proceedings of the 2nd 

annual symposium on Computer architecture, volume 3 issue 4 
Full text available: ^ pdft675.54 KB) Additional Information: full citation , abstract , references , ciiinos 

A processor is described which can achieve highly parallel execution of programs 
represented in data-flow form. The language implemented incorporates conditional and 
iteration mechanisms, and the processor is a step toward a practical data-flow processor for 
a Fortran-level data-flow language. The processor has a unique architecture which avoids 
the problems of processor switching and memory/processor interconnecion that usually limit 
the degree of realizable concurrent processing/The architect ... 

19 The KScaiar simulator 

J. C. Moure, Dolores I. Rexachs, Emilio Luque 

March 2002 Journal on Educational Resources in Computing (JERIC) volume 2 issue i 

Full text available: ^ odf(493 : 35 KB). Additional Information: MLcitatjon, abstract, references, index terms 

Modern processors increase their performance with complex microarchitectural mechanisms, 
which makes them more and more difficult to understand and evaluate. KScaiar is a 
graphical simulation tool that facilitates the study of such processors. It allows students to 
analyze the performance behavior of a wide range of processor microarchitectures: from a 
very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with 
non-blocking caches, speculative execution, and comp ... 

Keywords: Education, pipelined processor simulator 



20 SysteniarcM^ 
John W. Gordon 

June 1985 ACM Computing Surveys (CSUR), volume 17 issue 2 

Full text available: ■p Bodff4.61 MB'? Additional Information: fall citation , abstract, references , index terms, 

Computer music is a relatively new field. While a large proportion of the public is aware of 
computer music in one form or another, there seems to be a need for a better 
understanding of its capabilities and limitations in terms of synthesis, performance, and 
recording hardware. This article addresses that need by surveying and discussing the 
architecture of existing computer music systems. System requirements vary according to 
what the system will be used for. Common uses for co ... 
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